Introduction

In this tutorial, we will be fitting Bayesian distance sampling-adjusted log-Gaussian Cox process models to the survey data using the inlabru package. This will provide us with the pre-requisite knowledge needed to fit integrative joint log-Gaussian Cox process models with both survey and opportunistic sightings data reported by whale watch companies!

Load packages

Let’s load all the required R packages. See our tutorial 0_Installing_packages.html for installation information.

library(rgeos)
library(rgdal)
library(sp)
library(maptools)
library(spatstat)
library(spdep)
library(INLA)
library(inlabru)
library(readxl)
library(lubridate)
library(ggmap)
library(raster)

File set up

Using getwd() check that you are inside the folder ‘DFO_SDM_Workshop_2020’ that we downloaded in the previous tutorial. If not, change this by navigating to the correct folder in the bottom right panel (folder view), opening it, and then clicking “Set as Working Directory” under the tab ‘More’.

Now we can load in the precompiled data and functions

list2env(readRDS('./Data/Modelling_Data.rds'),globalenv())
## <environment: R_GlobalEnv>
source('utility_functions.R')

Finally, let’s turn off all warnings associated with coordinate reference systems.

rgdal::set_rgdal_show_exportToProj4_warnings(FALSE)
rgdal::set_thin_PROJ6_warnings(TRUE)
options("rgdal_show_exportToProj4_warnings"="none")

Defining the components of a log-Gaussian Cox process model

Creating a mesh

JOE: I’m assuming some of these things will be discussed in the lectures, but some of the concepts that you are talking about here will be really difficult to grasp to an average ecologist. I think it’s worth defining the important terms again here, e.g., Gaussian random fields. You should NOT defined the terms that are not essential e.g. SPDE. FOr those just stating they are a trick, as you did, is sufficient.

Log-Gaussian Cox process models require the specification of Gaussian random fields. Gaussian random fields can become computationally prohibitive. We will use a smart computational trick, which approximates a stochastic partial differential equation (SPDE) with a Gaussian Markov random field (GMRF) evaluated across a delauney triangulation mesh. This approach was developed by Lindgren and Rue (2011) (Lindgren being an author of the inlabru package).

The first step to using this approach is creating the mesh. Good rules of thumb are as follows:

  • Define maximum triangle lengths no shorter than half of the expected spatial range of your problem. Spatial range here is loosely defined as the ‘distance required between two locations for the whale intensity at these two locations to be approximately independent of each other’. Note that it is impossible to detect clusters or hotspots with sizes smaller than the maximum length of triangles used.

  • Do not define the minimum triangle lengths too small. This could lead to very large meshes! This is specified with the argument cutoff.

  • Use a minimum triangle angle of 21 degrees.

We define a first mesh using the function inla.mesh.2d from the INLA package. We choose a minimum triangle edge length of 18km, a maximum triangle edge length of 45km, a minimum angles of 25 degrees and we specify that we want a hard boundary equal to the coastline (as defined by the SpatialPolygonDataFrame DOMAIN). Note that we can plot meshes using gg as before1.

mesh_land <- inla.mesh.2d(boundary = Domain,
                               min.angle = 25,
                               max.edge = 45,
                               cutoff = 18)
ggplot() + gg(mesh_land) + gg.spatiallines_mod(Effort_survey,colour='red')

Note that spatial correlations < 18km will not be detectable! We can print the number of mesh vertices (directly related to the length of time required later to fit the model).

mesh_land$n # lots of triangle vertices can lead to slow computation time. 551 good
## [1] 468

We have a potential serious issue with this mesh. The coastline is jagged, which could lead to numerical instabilities to occur later. One easy solution is to buffer and/or smooth the coastline shapefile, making sure that all survey lines and sightings fall within the new modified shapefile. We did this in the exploratory analysis - creating a Domain_restricted object!

ggplot() + gg(Domain_restricted) + 
  gg.spatiallines_mod(Effort_survey,colour='red')

gContains(gBuffer(Domain_restricted, width=5), Effort_survey)
## [1] TRUE
# # Buffered by 5km
# ggplot() + gg(gBuffer(Domain, width=5)) + gg.spatiallines_mod(Effort_survey,colour='red')
# gContains(gBuffer(Domain, width=5), Effort_survey)
# # Doesn't contain all the survey lines
# 
# # Buffered by 10km
# ggplot() + gg(gBuffer(Domain,width=10)) + gg.spatiallines_mod(Effort_survey,colour='red')
# gContains(gBuffer(Domain,width=10), Effort_survey) 
# # Again, doesn't contain all the survey lines
# 
# # Buffered by 15km
# ggplot() + gg(gBuffer(Domain,width=15)) + gg.spatiallines_mod(Effort_survey,colour='red')
# gContains(gBuffer(Domain,width=15), Effort_survey) 
# # Success, does contain all the survey lines!

Using this new buffered and smoothed domain/study area shapefile (i.e. Domain_restricted), we create below a new mesh named mesh_land.

mesh_land <- inla.mesh.2d(boundary = Domain_restricted,
                         min.angle = 25,
                         max.edge = 45,
                         cutoff = 18)
mesh_land$crs <- Can_proj
mesh_land$n
## [1] 456
ggplot() +
  gg(mesh_land) + 
  gg.spatiallines_mod(Effort_survey,colour='red')

One may be tempted to define the mesh within a much larger convex hull around the coastline, but see Additional tips for explanation of why this is an inadequate approach.

Defining Gaussian random field and priors

Next, we need to define our Gaussian random field model and its priors. To do this, we use the function inla.spde2.pcmatern from the INLA package. Note that the arguments prior.sigma and prior.range define arguments for the penalized complexity (PC) prior of the random field developed by Simpson et al. (2017) and Fuglstad et al. (2019).

We are required to define PC priors, on the spatial range \(\rho\) (the wiggliness) and marginal standard deviation \(\sigma\) (the magnitude) of the field. PC priors take the form of prior probabilistic statements on quantiles, providing a user-friendly way of defining sensible, interpretable priors.

Before defining our priors, let’s remind ourselves how the Gaussian random field is affected by the range and marginal standard deviation. Below is a chunk of code defining a function (GRF_sim()) that that simulates and plots a single Gaussian random field on the unit square. You are asked to specify the range and marginal standard deviation:

# Specify spatial range and variance
sim_range <- 0.1
sim_sd <- 2
# Function for simulating and plotting a GRF on a unit square
GRF_sim <- function()
{
# define inla mesh
mesh <- inla.mesh.2d(loc.domain = cbind(c(0,1,1,0),c(0,0,1,1)),
                     min.angle = 21, max.edge = 0.04, cutoff = 0.02)
plot(mesh)
mesh$n

# define spde
spde_obj <- inla.spde2.pcmatern(mesh, constr = T,
                                prior.range = c(sim_range, 0.5),
                                prior.sigma = c(sim_sd, 0.5))

# define indices
index <- inla.spde.make.index('smooth',n.spde = spde_obj$n.spde)

# define prediction matrix
A_proj_pred <- inla.spde.make.A(mesh, loc = mesh$loc[,c(1,2)])

# simulate GRF
Q = inla.spde2.precision(spde_obj, theta = c(log(sim_range), log(sim_sd) ))
field <- inla.qsample(Q=Q)
field <- field - mean(field)

plot_pixels <- pixels(mesh, mask=as(owin(),'SpatialPolygons'), nx=150, ny=150)

# predict the values onto a high res grid
A_proj_grid <- inla.spde.make.A(mesh, loc=plot_pixels@coords)

plot_pixels$field <- as.numeric(A_proj_grid %*% field)

colsc <- function(...) {
  scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"RdYlBu")),
                       limits = range(..., na.rm=TRUE))
}

ggplot() + gg(plot_pixels) + colsc(plot_pixels$field) +
  ggtitle(paste0('SD = ',sim_sd,'  Range = ',sim_range))  
}

To simulate a field with the specified SD and range run

GRF_sim()

With the understanding of what the SD and range define, we are asked to define the following two probabilistic statements (remember we are working in units of km):

\[P(\rho < \rho_0) = p_{\rho} \\ P(\sigma > \sigma_0) = p_{\sigma}.\] The goal is to specify the lower tail quantile \(\rho_0\) and probability \(p_{\rho}\) for the range parameter and the upper tail quantile \(\sigma_0\) and probability \(p_{\sigma}\) for the marginal standard deviation. We choose these 4 parameters to help ensure that the field does not become to ‘complex’; specifically we want to avoid the field becoming too ‘wiggly’ or ‘large’. The PC prior shrinks the spatial field towards the base model of no spatial variance and variance zero, hence the name ‘Penalized Complexity’.

We now define our model, with \(\rho_0\), \(p_{\rho}\), \(\sigma_0\), and \(p_{\sigma}\) fixed at 35km, 0.01, 2, and 0.01 respectively.

# First define the SPDE model
matern_land <- inla.spde2.pcmatern(mesh_land, 
                              prior.sigma = c(2, 0.01), 
                              prior.range = c(35, 0.01))

Defining the detection probability function

Next, we must now define our detection probability function. First, we define a half-normal detection probability function. This object will take the form of a base R function. Remember that we have set all values of distance < 250m equal to 250m. Thus, under the assumption that a whale a distance of 250m from the aircraft is detected with probability 1, we need to subtract 0.25km from our distance value (to center it at 0)!

# Define a half-normal detection probability function (code from inlabru tutorials)
log_hn = function(distance, lsig){ 
  -0.5*((distance-0.25)/exp(lsig))^2
}

Combining the components of the log-Gaussian cox model

Next, we need to describe how inlabru should combine the components of the model: the Gaussian random field (GRF), the detection function parameter lsig, and the intercept.

We will give the GRF the name ‘mySpatial’. This name is arbitrary and can be changed. To define ‘mySpatial’ as a GRF, we will point it to the model object we defined earlier (matern_land) and will tell inlabru that the coordinates of the points are the locations at which we wish to evaluate the field at. coordinates is a special variable name reserved for the spatial coordinates of a SpatialPointsDataFrame object. The sightings data we will be using are all SpatialPointsDataFrame objects.

The scalar terms Intercept and the detection function parameter lsig are given the special arguments 1. This tells inlabru to create a scalar random variable. These will be available for every data point.

# Define the components of the model
cmp_land1 <- ~ mySpatial(main = coordinates, model = matern_land) + 
  lsig(1) + Intercept(1)

Note that the components do not define the formula of the model. We must define the formula separately. The formula describes how these components are combined to form the linear predictor. For our log-Gaussian Cox process, the linear predictor will be the model for log \(\lambda_{obs}\).

We now define the formula. On the left hand side of the formula are the response variables. We have two for a distance sampling dataset. The first, are the collections of locations \(Y\). The second are the recorded perpendicular distances from the aircraft.

On the right hand side of the formula, we have the GRF mySpatial. Next, we include the scalar parameter lsig. Unlike mySpatial, we include lsig in the linear predictor as an argument to the detection probability function log_hn. Notice that the log_hn function is evaluated at the response variable distance. Finally, we also include the variable Intercept.

Additionally, the offset of log(2) is required due to the unknown direction of the detections (detections can occur either side of the aircraft). For now, we ignore covariates.

formula_1 = coordinates + distance ~ mySpatial +
  log_hn(distance, lsig) + 
  Intercept + log(2)

For clarity, the linear predictor defined in formula_1 is:

\[\textrm{log}(\lambda_{true}(s)) = \beta_0 + Z(s) \\ \textrm{log}(p(d)) = \textrm{log}(2) + \frac{-d^2}{2\sigma^2}\].

Thus, the expected encounter rate of an animal at location \(s\), a distance \(d\) away from the aircraft is:

\[\lambda_{obs}(s) = \lambda_{true}(s) p(d)\].

Note that \(Z(s)\) is the value of the GRF mySpatial evaluated at position \(s\), and \(\sigma\) is the exponential of lsig.

Defining the components and the formula separately in two stages is what gives inlabru its generalisibility. Parameters can be included within a model non-linearly, unlike most other R packages! For example, lsig was included nonlinearly through our custom function log_hn! Behind the scenes, for nonlinear models, inlabru linearises the model via a Taylor series expansion.

Furthermore, components and formulae can be stacked together for datasets from differing sources, allowing for integrative joint models to be fit with relative ease! For example, the same components can be shared across different model, but take different forms for each (e.g. Intercept in model 1, Intercept^2 in model 2, etc.,)!

For detailed additional information on the types of model components allowed by inlabru and the linearisation procedure performed, call ?component and here.

Fitting the model

Speficing global options

Before fitting the model, we set the following global options for inlabru.

bru_options_set(
  bru_verbose = TRUE,
  verbose = TRUE,
  control.inla=list(int.strategy='eb'),
  num.threads=2
)

We want both INLA and inlabru to display all messages about the convergence, because these messages can help with diagnosing any issues with the model fitting. To do so, we set verbose arguments to TRUE.

For the purposes of this workshop, we also tell INLA to perform an empirical Bayes analysis (control.inla=list(int.strategy='eb')) instead of a full Bayesian analysis. This speeds up computation dramatically, but fails to propagate the full uncertainties associated with the hyperparameters into inference. This is because these empirical Bayes analyses condition on the single ‘most-likely’ model defined by the posterior modes of the hyperparameters. In contrast, a fully-Bayesian analysis performs Bayesian model averaging by averaging across (possibly infinitely many) models defined by all possible combinations of hyperparameter values weighted by their prior probabilities. The discrepancies between these two strategies can be large and in a publication, int.strategy should be set to auto (the default setting).

The number of threads (similar to cores) can be chosen. The higher the number, the faster the inference due to parallelisation. To see how many threads are available for use, run detectCores(). Always save at least 1 core for running background tasks!

Fit log-Gaussian Cox process model

We are ready! We fit our first log-Gaussian Cox process model in inlabru!

We will fit the model using the like and bru approach because it allows to easily combine different data sources or data streams into an integrative joint models.

First, we define our likelihood object using the function like. We specify that the SpatialPointsDataFrame Sightings_survey is our data. Remember - it is crucial that the data are of type SpatialPointsDataFrame for the special coordinates variable to work. Next, we specify the domains for both our response variables coordinates and distance. We point coordinates to the spatial mesh mesh_land and point the distance variable to a 1D mesh that we define. Remember that we have set the upper limit to the distance equal to 2km and a lower limit equal to 0.25km.

Next, we point the likelihood to the formula object we created earlier that defines the linear predictor and specify that a Cox process model is being fitted by specifying family='cp'. Note that a HUGE library of likelihoods are available. Calling inla.list.models() provides a list of every possible model, prior and latent effect available. Finally, we provide a sensible (negative) initial value for lsig.

fit = like(data = Sightings_survey,
           samplers = Effort_survey,
           domain = list(
             coordinates = mesh_land,
             distance = INLA::inla.mesh.1d(seq(0.25, 2, length.out = 30))),
           formula = formula_1,
           family = 'cp',
           options = list(bru_initial = list(lsig = -1)))

Next, we call the function bru. This function combines all the likelihood objects together with the specified components and performs model inference. Note that we only have a single likelihood object fit, with components cmp_land1.

Running this function may take 2-5 min.

# Warning: can be slow!
fit <- bru(fit, components = cmp_land1)

Is this a good fit? One simple way of gauging how well a model fits the data is to assess it’s DIC value. DIC is similar to AIC/BIC in that it trades of goodness-of-fit with a penalty for the complexity of the model under the Occam’s Razor principle of ‘simple explanations trump complicated ones when both predict outcomes equally well’. Whilst a lone DIC value is somewhat meaningless, we can compare the DIC values of competing models fit to the same data for comparison. The lower the DIC score, the better the model fit.

fit$dic$dic #DIC is 814. This is our benchmark
## [1] 814.3605
summary(fit) # WAIC 810
## inlabru version: 2.2.4.9000
## INLA version: 21.01.08-1
## Components:
##   mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
##   lsig: Model types main='linear', group='exchangeable', replicate='iid'
##   Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
##   Family: 'cp'
##     Data class: 'SpatialPointsDataFrame'
##     Predictor: 
##         coordinates + distance ~ mySpatial + log_hn(distance, lsig) + 
##             Intercept + log(2)
## Time used:
##     Pre = 5.56, Running = 9.87, Post = 1.24, Total = 16.7 
## Fixed effects:
##             mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## lsig      -0.270 0.118     -0.516   -0.265     -0.052 -0.256   0
## Intercept -6.507 0.346     -7.191   -6.506     -5.832 -6.503   0
## 
## Random effects:
##   Name     Model
##     mySpatial SPDE2 model
## 
## Model hyperparameters:
##                        mean     sd 0.025quant 0.5quant 0.975quant    mode
## Range for mySpatial 201.760 96.132     84.360  178.939     450.03 144.127
## Stdev for mySpatial   0.814  0.199      0.485    0.794       1.26   0.757
## 
## Expected number of effective parameters(stdev): 24.29(0.00)
## Number of equivalent replicates : 479.37 
## 
## Deviance Information Criterion (DIC) ...............: 814.36
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 22.22
## 
## Watanabe-Akaike information criterion (WAIC) ...: 810.16
## Effective number of parameters .................: 17.41
## 
## Marginal log-Likelihood:  -424.53 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

Let’s turn off the (slightly overwhelming) messages from INLA for the remainder of the workshop.

bru_options_set(
  bru_verbose = TRUE,
  verbose = FALSE,
  control.inla=list(int.strategy='eb'),
  num.threads=2
)

Exploring the conclusions from the inlabru model.

Now that we have fit our log-Gaussian Cox process model, what can we learn? What can we say about the whale intensity?

inlabru provides many user-friendly functions for answering these questions. Three such functions are spde.posterior, predict and pixels. Let’s see them in action.

Mapping the predicted whale intensity

As a first objective, let’s plot the predicted whale intensity across the domain. This plot can help to visually assess where the model predicts ‘hotspots’ or areas of high whale activity exist. The following workflow can help to make professional plots easily:

  • First, we must define SpatialPixels across our domain using the helper function pixels.
# Define pixels for plotting
plot_pixels <- pixels(mesh_land, mask = Domain)
# mask = Domain ensures we do not predict in our 'buffered' regions, but only the original Domain
  • Next, we must predict a user-specified function of the random components across the SpatialPixels using the function predict. The package inlabru (and INLA) allows for approximate Monte Carlo samples of each random component to be generated from the posterior distributions of the model. The function predict then computes the user-specified function for each sample before taking the empirical mean, standard deviation, quantiles, etc., of these values. This approach allows the posterior distribution of arbitrarily complex nonlinear functions of parameters to be investigated.

For the purposes of this workshop, we are going to use only 20 Monte Carlo samples. In practice, to reduce Monte Carlo error, this number may need to be >1000. The Monte Carlo standard error should always be reported. We set the seed so that the results are reproducible.

# Set random seed
seed <- c(12345)
# Predict the intensity defined by the function exp(mySpatial + Intercept) across the pixels
pred_int <- predict(fit, plot_pixels, ~ exp(mySpatial + Intercept), n.samples = 20, seed=seed)
# What summary statistics have been computed for the intensity?
names(pred_int)
## [1] "mean"   "sd"     "q0.025" "median" "q0.975" "smin"   "smax"   "cv"    
## [9] "var"

Notice how lots of useful summary statistics are computed by default using the predict function!

  • Finally, we can plot these prediction objects using gg. We will plot both the posterior mean (representing the predicted whale intensity) and the posterior standard deviation (representing the uncertainty around the prediction).
# Define a 'pretty' colour palette
colsc <- function(...) {
  scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"RdYlBu")),
                       limits = range(...))
}
# Plot the posterior mean intensity
ggplot() + gg(pred_int[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
  colsc(pred_int@data$mean)

# Plot the posterior SD (note the index [2])
ggplot() + gg(pred_int[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
  colsc(pred_int@data$sd)

From these plots, we see 2 interesting features. First, a very low whale intensity is predicted in the regions frequently visited by the whale-watching companies. Second, hotspots are identified in the Western region that had lots of survey effort focused in it, and in the center of the study region.

Plotting the posterior distribution of parameters and ??

Next, let’s look at the posterior distributions of the spatial range and the marginal variance of the random field. Remember: the spatial range (in km) is estimating the ‘approximate distance required between two points for the values of the intensity their to be approximately independent’. For this, we use the functions spde.posterior and plot.

spde.range <- spde.posterior(fit, "mySpatial", what = "range")
plot(spde.range)

spde.var <- spde.posterior(fit, "mySpatial", what = "variance")
plot(spde.var)

We can see that the model is estimating a spatial field with a reasonably large spatial range and small variance. No evidence of severe overfitting is shown (a very large variance with a tiny spatial range can be suspicious).

Next, we can plot the estimated detection probability function, with uncertainty intervals. For this, we use predict once again, with one slight modification. Instead of using pixels to define a set of SpatialPixels to plot over, we must manually specify a data.frame object with the desired values of distance to predict across. Then, we can use the function plot as before. Note: I have had some issues with using the predict function on non-spatial data.frames. If you get an error message, see Additional tips.

distdf <- data.frame(distance = seq(0.25, 2, length = 100))
dfun <- predict(fit, distdf, ~ exp(log_hn(distance, lsig)), n.samples = 20, seed=seed)
plot(dfun)

This shows the probability of detecting a whale (given that it is at the ‘surface’), as a function of distance from the vessel. The posterior mean and 95% credible intervals are shown. Note that for a publication, n.samples may need to be much higher. To estimate the Monte Carlo standard error at a distance 1, we can do the following:

dfun$sd[50]/sqrt(20) #sqrt(20) by central limit theorem
## [1] 0.01824079

This is huge!

Predicting the population size

inlabru provides functions for predicting the population size! That is, we can investigate the posterior distribution for the expected number of whales with the following caveats:

  • We are assuming that every encounter made was with a single individual

  • We are assuming that all individuals can be spotted at distance 0m!

  • We have not factored in availability bias (perhaps whales are diving and missed)

  • We have not factored in the influence of group size on detection probability!

  • We are assuming sufficient mixing time allowed between encounters

Ignoring these issues for now, we predict the total population size falling within the Domain. To achieve this, we use the function ipoints. This function sets a series of integration points with weights (areas), to allow us to approximate the integral of the intensity surface across the domain with a Riemann sum approximation. Basically, it approximates: \[\int_\Omega \mathbf{E}_Z \left( \mathbf{E}( \lambda_{true}(s) | Z) \right) ds\]

with \[\sum_{A_i} \mathbf{E}_Z \left( \mathbf{E}( \hat{\bar{\lambda}}_{true}(s_i) | Z) |A_i| \right) ds\]

Note that ipoints creates a SpatialPointsDataFrame object with a single named variable weight containing the area of each region corresponding to each integration point.

predpts <- ipoints(Domain, mesh_land)
head(predpts$weight) # these are the areas of the regions corresponding each point.
## [1] 348.89742  39.76131  32.47272  88.08495 200.54656  47.12136

Then, we use predict, with the user-specified function equal to a weighted sum of the intensity. Notice that we have omitted the detection probability function log_hn. Thus, we are predicting across \(\Omega\), assuming perfect detectability.

# Define the weighted sum function
Lambda <- predict(fit, predpts, ~ sum(weight * exp(mySpatial + Intercept)), n.samples = 20, seed=seed)
Lambda
##       mean       sd   q0.025   median   q0.975     smin     smax        cv
## 1 714.5626 153.4043 473.1161 705.1022 967.7446 410.0004 989.7033 0.2146828
##        var
## 1 23532.89

The population estimated is approximately 715 individuals with a large variance of approximately 2.410^{4}.

There remains a potential problem with our inference. If we look back at the plot of our domain, we notice that it extends in the South-Western direction beyond our survey tracklines about 45km. Thus, there is a risk of extrapolation error! Ideally, we should restrict our predictions to lie ‘close’ to our effort!

Exercise 1

Part A. Plot a map of the whale intensity, but colour grey all pixels that lie over 20km away from the nearest survey trackline.

Hint 1: To extract a subset of a SpatialPoints object X that lies within a 20km distance of another Spatial object Y, we run the following code:

X_subset < X[which(apply(gWithinDistance(X,Y,dist = 20,byid = T),2,sum)>0),]

Hint 2: We will need to convert the SpatialPixels intensity object into a SpatialPoints before subsetting. To convert a SpatialPixels object A to a SpatialPoints object B run the following:

B <- as(A, 'SpatialPointsDataFrame')

If you get really stuck, feel free to show the answer code!

# Convert the SpatialPixels into a SpatialPoints object before subsetting
pred_int_points <- as(pred_int,'SpatialPoints')
# Subset to 20 km from Effort lines
pred_int_restricted <- 
  pred_int[which(apply(gWithinDistance(pred_int_points, Effort_survey,dist = 20,byid = T),2,sum)>0),]

# Plot the posterior mean
ggplot() + gg(pred_int_restricted[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
  colsc(pred_int_restricted@data$mean)

# Plot the standard deviation
ggplot() + gg(pred_int_restricted[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
  colsc(pred_int_restricted@data$sd)

Part B. Estimate the total number of whales in the region lying within 30km of the nearest trackline.

If you get really stuck, feel free to show the answer code!

# Subsetting the ipoints that contain the weights (area)
predpts_restricted <- predpts[
  which(apply(gWithinDistance(predpts,Effort_survey,dist = 20,byid = T),2,sum)>0),]

# Using predict to 
Lambda_restricted <- predict(fit, predpts_restricted, ~ sum(weight * exp(mySpatial + Intercept)), n.samples = 20, seed=seed)

Lambda_restricted
##       mean       sd   q0.025   median   q0.975     smin     smax        cv
## 1 626.5267 132.0626 419.0442 601.1097 819.0772 361.9115 833.1353 0.2107853
##        var
## 1 17440.54

You should get a smaller population estimate of approximately 600 individuals and a smaller, but still large, variance of approximately 210^{4}. Don’t worry if your answers are slightly different - we’ve only used 20 samples!

Adding covariates to an inlabru model

Adding environmental covariates to models is made easy in inlabru. All that is required is that each covariate is converted to class SpatialPixelsDataFrame and that the covariate data are complete. By ‘complete’, we mean that a non-missing value is available at every point within the region defined by the mesh boundary.

Let’s add the log_Depth covariate. To do this, we need to redefine the components and formula. First we define the updated components.

cmp_land2 <- ~ mySpatial(main = coordinates, model = matern_land) + 
  Intercept(1) + log_Depth(main = log_Depth, model='linear') +
  lsig(1)

Notice that we have defined a new variable called log_Depth. Using the argument main, we point the variable to the SpatialPixelsDataFrame object log_Depth. log_Depth defines the product of a parameter (e.g. \(\beta_{SST}\)) and the value of the covariate log Depth evaluated at the correct spatial location (e.g. \(X(s)\)). Finally, we tell inlabru that we are fitting a linear effect.

Next, we redefine our new formula object:

formula_2 = coordinates + distance ~ mySpatial +
  Intercept + log_Depth +
  log_hn(distance, lsig) + 
  log(2)

The only change here, is the addition of the linear effect log_Depth to the linear predictor. For clarity, the linear predictor defined in formula_2 is:

\[\textrm{log}(\lambda_{true}(s)) = \beta_0 + \beta_1 X(s) + Z(s) \\ \textrm{log}(p(d)) = \textrm{log}(2) + \frac{-d^2}{2\sigma^2}\].

Thus, the expected encounter rate of an animal at location \(s\), a distance \(d\) away from the aircraft is:

\[\lambda_{obs}(s) = \lambda_{true}(s) p(d)\].

Note that \(\beta X(s)\) is the value of log_Depth at position \(s\), \(Z(s)\) is the value of the GRF mySpatial evaluated at position \(s\), and \(\sigma\) is the exponential of lsig.

Now we can fit the model using the like/bru approach:

Step 1, define the likelihood

fit2 = like(data = Sightings_survey,
            samplers = Effort_survey,
            domain = list(
              coordinates = mesh_land,
              distance = INLA::inla.mesh.1d(seq(0.25, 2, length.out = 30))),
            formula = formula_2,
            family = 'cp')

Step 2, fit the model with bru.

This will take 2-5 min to run.

# Slow to run bru
fit2 <- bru(fit2, components = cmp_land2)
## iinla: Iteration 1 [max:10]
## iinla: Iteration 2 [max:10]
## iinla: Max deviation from previous: 81.7% of SD [stop if: <1%]
## iinla: Iteration 3 [max:10]
## iinla: Max deviation from previous: 4.87% of SD [stop if: <1%]
## iinla: Iteration 4 [max:10]
## iinla: Max deviation from previous: 9.46% of SD [stop if: <1%]
## iinla: Iteration 5 [max:10]
## iinla: Max deviation from previous: 13.9% of SD [stop if: <1%]
## iinla: Iteration 6 [max:10]
## iinla: Max deviation from previous: 0.875% of SD [stop if: <1%]
## iinla: Convergence criterion met, running final INLA integration with known theta mode.
## iinla: Iteration 7 [max:10]

We can then explore the results.

fit2$dic$dic # DIC is 811 again - marginal improvement
## [1] 810.7794
summary(fit2) # WAIC 807 again - marginal improvement
## inlabru version: 2.2.4.9000
## INLA version: 21.01.08-1
## Components:
##   mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
##   Intercept: Model types main='linear', group='exchangeable', replicate='iid'
##   log_Depth: Model types main='linear', group='exchangeable', replicate='iid'
##   lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
##   Family: 'cp'
##     Data class: 'SpatialPointsDataFrame'
##     Predictor: 
##         coordinates + distance ~ mySpatial + Intercept + log_Depth + 
##             log_hn(distance, lsig) + log(2)
## Time used:
##     Pre = 5.98, Running = 7.1, Post = 1.24, Total = 14.3 
## Fixed effects:
##             mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## Intercept -5.116 0.791     -6.714   -5.100     -3.605 -5.069   0
## log_Depth -0.300 0.136     -0.560   -0.302     -0.025 -0.307   0
## lsig      -0.270 0.118     -0.516   -0.265     -0.052 -0.256   0
## 
## Random effects:
##   Name     Model
##     mySpatial SPDE2 model
## 
## Model hyperparameters:
##                        mean      sd 0.025quant 0.5quant 0.975quant    mode
## Range for mySpatial 282.196 150.165    107.788  244.377     672.27 189.982
## Stdev for mySpatial   0.923   0.256      0.516    0.891       1.51   0.831
## 
## Expected number of effective parameters(stdev): 23.55(0.00)
## Number of equivalent replicates : 494.46 
## 
## Deviance Information Criterion (DIC) ...............: 810.78
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 21.64
## 
## Watanabe-Akaike information criterion (WAIC) ...: 806.87
## Effective number of parameters .................: 17.10
## 
## Marginal log-Likelihood:  -427.90 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

So, we find that adding log_Depth to the model improves the model fit, as judged by DIC and WAIC. In the paper that developed the DIC method, Spiegelhalter et al (2002) had the following to say: “Burnham and Anderson (1998) suggested models receiving AIC within 1–2 of the ‘best’ deserve consideration, and 3–7 have considerably less support: these rules of thumb appear to work reasonably well for DIC…”

Thus, given that we detect a difference in DIC of (slightly) over 3. We conclude that the model without log_Depth has considerably less support over the model with log_Depth included.

Furthermore, inspecting the output from summary(), we see that the 95% credible intervals do not cover zero. Thus, we detect a ‘significant’ negative effect of log_Depth. That is, we have evidence to suggest that the whales prefer shallower waters, after controlling for effort.

Exercise 2

Fit a model, with a quadratic term for log_Depth. To acheive this, include log_Depth_sq alongside log_Depth. Does this improve the model fit? Are either of the variables (the linear or quadratic terms) found to be significantly associated with the whale intensity?

Hint: Remember that there are multiple steps.

Part A. Define the model.

cmp_land3 <- ~ mySpatial(main = coordinates, model = matern_land) + 
  Intercept(1) + log_Depth(main = log_Depth, model='linear') +
  log_Depth_sq(main = log_Depth_sq, model='linear') +
  lsig(1)

formula_3 = coordinates + distance ~ mySpatial +
  Intercept + log_Depth +
  log_Depth_sq +
  log_hn(distance, lsig) + 
  log(2)

Part B. Define the likelihood with like.

fit3 = like(data = Sightings_survey,
            samplers = Effort_survey,
            domain = list(
              coordinates = mesh_land,
              distance = INLA::inla.mesh.1d(seq(0.25, 2, length.out = 30))),
            formula = formula_3,
            family = 'cp')

Part C. Fit the model with bru.

fit3 <- bru(fit3, components = cmp_land3)
## iinla: Iteration 1 [max:10]
## iinla: Iteration 2 [max:10]
## iinla: Max deviation from previous: 1840% of SD [stop if: <1%]
## iinla: Iteration 3 [max:10]
## iinla: Max deviation from previous: 79700% of SD [stop if: <1%]
## iinla: Iteration 4 [max:10]
## iinla: Max deviation from previous: 2690000% of SD [stop if: <1%]
## iinla: Iteration 5 [max:10]
## iinla: Max deviation from previous: 67700% of SD [stop if: <1%]
## iinla: Iteration 6 [max:10]
## iinla: Max deviation from previous: 147000% of SD [stop if: <1%]
## iinla: Iteration 7 [max:10]
## iinla: Max deviation from previous: 3680000% of SD [stop if: <1%]
## iinla: Iteration 8 [max:10]
## iinla: Max deviation from previous: 31700% of SD [stop if: <1%]
## iinla: Iteration 9 [max:10]
## iinla: Max deviation from previous: 1240000% of SD [stop if: <1%]
## iinla: Maximum iterations reached, running final INLA integration.
## iinla: Iteration 10 [max:10]
fit3$dic$dic # DIC is 798  - an apparent improvement
## [1] 13283.98
summary(fit3) # WAIC is 797 - an apparent improvement
## inlabru version: 2.2.4.9000
## INLA version: 21.01.08-1
## Components:
##   mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
##   Intercept: Model types main='linear', group='exchangeable', replicate='iid'
##   log_Depth: Model types main='linear', group='exchangeable', replicate='iid'
##   log_Depth_sq: Model types main='linear', group='exchangeable', replicate='iid'
##   lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
##   Family: 'cp'
##     Data class: 'SpatialPointsDataFrame'
##     Predictor: 
##         coordinates + distance ~ mySpatial + Intercept + log_Depth + 
##             log_Depth_sq + log_hn(distance, lsig) + log(2)
## Time used:
##     Pre = 7.08, Running = 101, Post = 0.945, Total = 109 
## Fixed effects:
##                mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## Intercept    -0.402 0.014     -0.430   -0.402     -0.374 -0.402   0
## log_Depth    -0.068 0.004     -0.076   -0.068     -0.061 -0.068   0
## log_Depth_sq -0.009 0.003     -0.014   -0.009     -0.004 -0.009   0
## lsig          0.055 0.009      0.037    0.055      0.074  0.055   0
## 
## Random effects:
##   Name     Model
##     mySpatial SPDE2 model
## 
## Model hyperparameters:
##                        mean    sd 0.025quant 0.5quant 0.975quant    mode
## Range for mySpatial 233.018 0.491    231.938  233.071     233.84 233.266
## Stdev for mySpatial   0.985 0.019      0.943    0.987       1.02   0.995
## 
## Expected number of effective parameters(stdev): 307.47(0.00)
## Number of equivalent replicates : 37.87 
## 
## Deviance Information Criterion (DIC) ...............: 13283.98
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 127.68
## 
## Watanabe-Akaike information criterion (WAIC) ...: 13259.80
## Effective number of parameters .................: 102.04
## 
## Marginal log-Likelihood:  -11742.26 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

The DIC and WAIC results indicate that there is an apparent improvement, suggesting that there is a quadratic trend between the log whale intensity and log depth. However, there are severe convergence issues being reported by inlabru. In particular, the ‘Max deviation from previous’ does not appear to be converging to 0%. Thus, we should report results from this model with severe caution.

Comparing the output from competing models

So, we ultimately find that the model with a linear effect of log_depth (fit) offers the best model fit to the survey dataset. The quadratic model may offer an improvement, but there appears to be convergence issues with fitting the model. We do not consider the results from this model further. These convergence issues should come as no surprise: the survey datasets are quite limited in size, with only 63 sightings in total! In fact, a LGCP model should be used with extreme caution with such a small dataset.

The DIC values very similar for fit and fit2 - the models with and without log_Depth included. Let’s see how the predictions and parameter estimates change under the second model. Are the conclusions drawn from the two models consistent?

plot_pixels2 <- pixels(mesh_land, mask = Domain)

pred_int2 <- predict(fit2, plot_pixels2, 
                     ~ exp(mySpatial + Intercept + log_Depth), n.samples = 20, seed=seed)

multiplot(ggplot() + gg(pred_int2[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
            colsc(c(pred_int2@data$mean,pred_int@data$mean)) + ggtitle('Covariate Model Mean'),
          ggplot() + gg(pred_int2[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
            colsc(c(pred_int2@data$sd,pred_int@data$sd)) + ggtitle('Covariate Model SD'),
          ggplot() + gg(pred_int[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
            colsc(c(pred_int2@data$mean,pred_int@data$mean)) + ggtitle('Covariate-Free Model Mean'),
          ggplot() + gg(pred_int[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
            colsc(c(pred_int2@data$sd,pred_int@data$sd)) + ggtitle('Covariate-Free Model SD'),
          layout = matrix(c(1:4),nrow=2,ncol=2,byrow = F))

# Look at parameters of spde model
spde.range2 <- spde.posterior(fit2, "mySpatial", what = "range")
spde.var2 <- spde.posterior(fit2, "mySpatial", what = "variance")
multiplot(plot(spde.range2)+ylim(range(spde.range$pdf)),plot(spde.range)+xlim(range(spde.range2$range)),
          plot(spde.var2)+ylim(range(spde.var$pdf)),plot(spde.var)+xlim(range(spde.var2$variance)),
          layout = matrix(1:4,2,2,byrow = F))

Next, we use predict and plot to display the estimated effect of log_Depth on the intensity. Note that the effect is nonlinear on the intensity scale, as the covariate is specified to be linear on the log intensity scale.

# plot the covariate effects
depthdf <- data.frame(log_Depth = seq(from=min(log_Depth@data$log_Depth),
                                      to=max(log_Depth@data$log_Depth),
                                      length.out = 100))
depthpred <- predict(fit2, depthdf, ~ exp(log_Depth), n.samples=40, seed=seed)
plot(depthpred)

# Manually if the predict function breaks
#samples <- generate(fit2, n.samples = 20,, seed=seed)
# depthpred <- sapply(samples,FUN = function(x){return(depthdf$log_Depth*x$log_Depth)})
# ggplot(data.frame(mean=apply(depthpred,1, mean),
#                   LCL=apply(depthpred,1, quantile, probs=c(0.025)),
#                   UCL=apply(depthpred,1, quantile, probs=c(0.975)),
#                   logdepth=seq(from=min(log_Depth@data$log_Depth),
#                             to=max(log_Depth@data$log_Depth),
#                             length.out = 100)),
#        aes(y=mean,x=logdepth,ymax=UCL,ymin=LCL)) +
#   geom_line() + geom_ribbon(alpha=0.4)

Note that the credible intervals cover a large range of effect sizes! There is a large amount of uncertainty with the effect of log depth! However, the estimated effect size is quite large; the estimated whale intensity at a log depth of 5 is expected to be around a quarter of that at a log depth of 0!

Finally, we plot the estimated population size across both the entire domain and the restricted domain (restricted to points lying within 30km of the nearest survey trackline). We then plot the estimated detection function. Notice that the predict function fails to work for the variable distance! This is a bug in inlabru that will hopefully be corrected for soon. In the meantime, we manually extract the results by sampling all the components using the function generate and then manually combining them using sapply. Note that the variables LCL and UCL define a symmetric 95% credible interval.

# Plot detection probability function
distdf <- data.frame(distance = seq(0.25, 2, length = 100))
#dfun2 <- predict(fit2, distdf, ~exp(log_hn(distance,lsig)),n.samples = 20)
# If this calls an error run below
samples <- generate(fit2, n.samples = 20, seed=seed)
dfun2 <- sapply(samples,FUN = function(x){return(exp(log_hn(distdf$distance,x$lsig)))})
ggplot(data.frame(mean=apply(dfun2,1, mean),
                  LCL=apply(dfun2,1, quantile, probs=c(0.025)),
                  UCL=apply(dfun2,1, quantile, probs=c(0.975)),
                  distance=distdf$distance),
       aes(y=mean,x=distance,ymax=UCL,ymin=LCL)) +
  geom_line() + geom_ribbon(alpha=0.4)

# We can look at the posterior for expected number of whales:
predpts <- ipoints(Domain, mesh_land)
# It looks like we do not need to do this!

Lambda2 <- predict(fit2, predpts, ~ sum(weight * exp(mySpatial + Intercept +
                                                       log_Depth)),
                   n.samples=20, seed=seed)
## Warning in input_eval.bru_input(component[[part]]$input, data, env = component$env, : Model input 'log_Depth' for 'main' returned some NA values.
## Attempting to fill in spatially by nearest available value.
## To avoid this basic covariate imputation, supply complete data.
Lambda2
##       mean      sd   q0.025   median   q0.975     smin     smax        cv
## 1 705.3013 142.495 520.2813 659.2864 992.1128 478.7026 1029.601 0.2020342
##        var
## 1 20304.82
Lambda
##       mean       sd   q0.025   median   q0.975     smin     smax        cv
## 1 714.5626 153.4043 473.1161 705.1022 967.7446 410.0004 989.7033 0.2146828
##        var
## 1 23532.89
# Avoid extrapolation
Lambda2_restricted <- predict(fit2, predpts_restricted, 
                              ~ sum(weight * exp(mySpatial + Intercept +
                                                 log_Depth)),
                              n.samples=20, seed=seed)
Lambda2_restricted
##       mean       sd   q0.025   median   q0.975     smin     smax        cv
## 1 632.1754 118.5818 457.6355 616.7085 878.0774 423.4721 908.9978 0.1875774
##        var
## 1 14061.65
Lambda_restricted
##       mean       sd   q0.025   median   q0.975     smin     smax        cv
## 1 626.5267 132.0626 419.0442 601.1097 819.0772 361.9115 833.1353 0.2107853
##        var
## 1 17440.54

We indeed see very similar estimates for the population size, with overlapping credible intervals. Remember that in a publication we would draw far more than 20 samples!

In this session, we have learned how to fit log-Gaussian Cox process models to survey datasets that followed distance sampling protocols within a Bayesian framework. We have seen how to compare models, how to predict intensity across space, and how to estimate population size.

Coming up next, we will see how to incorporate presence-only datasets into our models. Crucially, we will find that by incorporating presence-only datasets into our models, we may be able to uncover previously hidden species-environment relationships, and use these to improve the resolution of our maps of whale intensity!

Additional tips and code

Defining a mesh that ignores the coastline

An alternative ‘solution’ to the jagged mesh problem is to define the mesh within a much larger convex hull around the coastline. Unfortunately, this approach will not take into account the coastline and hence will measure spatial ‘distance’ as euclidean distance and not ‘ocean-distance’. One such mesh is defined below:

# Compare with a mesh that ignores the land barrier
mesh_noland <- inla.mesh.2d(boundary = Domain_restricted,
                          min.angle = 25,
                          max.edge = c(45,60),
                          cutoff = 30,
                          offset = c(10,20))
mesh_noland$crs <- Can_proj
plot(mesh_noland)

mesh_noland$n
## [1] 405

We ignore this mesh for the remainder of this session, since it fails to account for land as a barrier.

Problems with using predict on non-spatial data.frames

I have had some issues with using the predict function on non-spatial data.frames. If you get an error message, try using the generate - sapply method shown below.

samples <- generate(fit, n.samples = 20, seed=seed)
distpred <- sapply(samples,FUN = function(x){return(exp(log_hn(distdf$dist,x$lsig)))})
ggplot(data.frame(probability=apply(distpred,1, mean),
                   LCL=apply(distpred,1, quantile, probs=c(0.025)),
                   UCL=apply(distpred,1, quantile, probs=c(0.975)),
                   distance=distdf$distance),
        aes(y=probability,x=distance,ymax=UCL,ymin=LCL)) +
   geom_line() + geom_ribbon(alpha=0.4)

saveRDS(fit,'./Model_Files/fit.rds')
saveRDS(fit2,'./Model_Files/fit2.rds')
rm(fit)
rm(fit2)
rm(fit3)
save.image('./Data/Modelling_Data2_New.RData')

References

Fuglstad, Geir-arne, Daniel Simpson, Finn Lindgren, and Håvard Rue. 2019. “Constructing priors that penalize the Complexity of Gaussian random fields.” Journal of the American Statistical Association 1459 (525): 445–52. https://doi.org/10.1080/01621459.2017.1415907.

Lindgren, Finn, and Håvard Rue. 2011. “An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic.” Journal of the Royal Statistical Society B 73 (4): 423–98.

Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, and Sigrunn H Sørbye. 2017. “Penalising model component complexity: a principled, practical approach to constructing priors.” Statistical Science 32 (1): 1–28. https://doi.org/10.1214/16-STS576.


  1. Alternatively we could use plot(mesh_land)↩︎